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Abstract 



A model for constrained computerized adaptive testing is proposed in which the information 
in the test at the ability estimate is maximized subject to a large variety of possible constraints 
on the contents of the test. At each item-selection step, a full test is first assembled to have 
maximum information at the current ability estimate fixing the items previously administered. 
Then the item with maximum information is selected from the test. All test assembly is 
optimal due to the use of a linear programming model which is automatically updated to 
allow for the attributes of the items already administered as well as the new value of the 
ability estimator. A simulation study using a pool of 753 items from the LSAT showed that 
for adaptive tests of realistic lengths the ability estimator did not suffer any loss of efficiency 
from the presence of 433 constraints on the item selection process. 
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A Model for Optimal Constrained Adaptive Testing 

The concept of adapting the difficulty of the test to the ability of the individual 
examinee is as old as the first intelligence test (Binet & Simon, 1905). In the Binet-Simon 
test, the items varied according to age group and the examiner was instructed to infer the next 
age group from the responses of the examinee to the previous test items until the true age 
group could be identified with sufficient certainty. In doing so, Binet and Simon intuitively 
followed the statistical principle that the information provided by test items is maximal if 
their difficulty matches the level of ability of the examinee. 

Since modem group-based testing was introduced, attempts have been made to 
implement this principle of adaptivity in a practical format. One of the first attempts was two- 
stage testing-a testing format in which the score on a routing test directs the examinee to one 
of a limited number of measurement tests. In the self-scoring flexilevel test, a testing format 
proposed by Lord (1980, chap. 8), the examinee scores his/her own responses by scratching 
an answer sheet and is instructed to move on to the next item as a function of the correctness 
of the response. In Weiss' (1973) computerized stradaptive test, the items in the pool , are 
divided into strata of difficulty and ordered according to their discrimination power within 
each stratum. The examinee moves to the next item in the higher stratum if his/her response is 
correct but to a lower stratum if it is incorrect. For a more extensive description of these early 
forms of adaptive testing, see Wainer (1990) or Weiss (1985). 

With the advent of powerful personal computers and the acceptance of item response 
theory (IRT) as a tool for calibrating item pools, large-scale application of fully computerized 
adaptive testing (CAT) has become possible. A well-known procedure in adaptive testing is 
maximum-information item selection in combination with maximum-likelihood estimation of 
ability. In this paper, it is assumed that the responses to the items in the pool fit the three- 
parameter logistic (3-PL) response model 

Pi(0) = Prob{Ui = 1 ) = Q + (l-c i )[l + exp(-a i (e i -b i ))r\ (1) 

where 0 a e (-©o, ©©) is a parameter for the ability of examinee a, and bi e (-©©, ©©) and a* € [0, ©©) 
are parameters for the difficulty and discrimination power of item i, respectively. For this model 
Fisher's information on 6 in item i can be shown to be equal to 

1 .( 0 ) = a?P,(0)Q,(0). (2) 



with Qi(0) = l-Pj(0). The maximum-information principle selects the next item to have a 
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maximum value for (2) at the current ability estimate. With a modem PC the time needed to 
calculate the maximum-likelihood ability estimate and select the item with maximum 
information from an item pool of realistic size is hardly noticeable by the examinee. 

Paradoxically, now that fully computerized CAT is technically possible the interest 
seems to be moving back to earlier forms of adaptive testing. The reason for this unexpected 
development lies in the fact that the original conception of CAT focusses entirely on the 
statistical aspects of item selection and ability estimation and ignores all other test 
specifications typically in use in testing programs. As a consequence, it may lead to testing 
programs that: 

1 . do not guarantee equal composition of tests across examinees, and hence loose 
their face validity; 

2. excludes the use of item pools with dependencies between the items, for 
example, between items that can not be administered in the same test because 
one item contains a clue to the solution to another item or between items that 
have to be presented in sets because they are linked to a common stimulus; 

3. overexposes some items, with the potential danger that the items become 
known prematurely to the examinees; 

4. do not allow for the possibility of reviewing responses to earlier items— a 
feature some programs want to offer to their examinees. 

Several solutions to these problems have been proposed. Wainer and Kiely (1987) 
suggest adaptive testing from a pool of testlets rather than individual items, designing the 
testlets to ensure adequate content coverage in the individual tests. The same goal is 
addressed in the proposal by Kingsbury and Zara (1991) who suggest spiraling item selection 
along subsets of items in the pool defining relevant content dimensions. Adema (1990) and 
Luecht (1995) use optimization techniques to assemble a system of two-stage tests with each 
possible route meeting the same set of test specifications. Reese and Schnipke (1996) 
combine the ideas of two-stage and testlet-based testing. A probabilistic mechanism to govern 
the exposure rates of items in CAT is presented in Sympson and Hetter (1985). Stocking and 
Swanson (1993) propose a heuristic for sequential item selection that treats the test 
specifications as well as the goal of maximum information as "desirable properties" of the test 
and then compromises between them at each item-selection step. 

It is the purpose of the present paper to propose a new form of constrained CAT. The 
procedure starts with the on-line assembly of a full test that meets all of the specifications and 
has maximum information at an initial estimate of the ability of the examinee. The assembly 
of the test is optimal due to the use of a linear programming (LP) model of the test 

ERJC 8 C0PY available 
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specifications. The first item to be administered is selected from this test according to the 
maximum-information principle. At each next step, the LP model is updated to allow for the 
values of the attributes of the items already administered, and the remaining part of the test is 
reassembled to have maximum information at the new ability estimate. The approach 
improves on conventional multi-stage or testlet-based adaptive testing designs in that there is 
no need to assemble fixed subtests or testlets in advance. All test assembly is on line to ensure 
maximum information at the current ability estimate. At the same time, unlike conventional 
CAT, item selection automatically satisfies the test specifications. The idea to base CAT on a 
process of reassembling full tests was developed independently by Cordova (1996). The 
approach is an alternative to the sequential heuristic proposed by Stocking and Swanson 
(1993); it is more rigorously based on the ideas developed for the application of linear 
programming (LP) to optimal test assembly, does guarantee that all of the test specifications 
are met, and has the explicit objective of maximum information in the test. A discussion of 
the precise differences between existing approaches and the present approach to constrained 
adaptive testing is postponed until the latter has been presented in more detail. 

In the remaining part of the paper, constrained adaptive test assembly is first 
conceptualized as an adaptive solution to an LP model for test assembly. An example of a 
model is given and possible implementations are discussed. For two different 
implementations, the statistical properties of the ability estimator are compared in a 
simulation study using an existing item pool for the Law School Admission Test (LSAT). 

General Model of Constrained Test Assembly 

The concept underlying the following sections is that the process of test assembly can 
be characterized as an instance of constrained optimization. Formally, each constrained 
optimization problem has: (1) an objective function defined on the decision variables of the 
problem which is maximized or minimized; and (2) a series of constraints on the possible 
values of the decision variables which together define a feasible solution to the problem. In 
test assembly, for example, the objective may be to match the test information function to a 
target and the constraints may require that prespecified numbers of items be selected from 
certain content categories. If the objective function and constraints are linear in the decision 
variables, the problem belongs to the domain of linear programming (LP), which has a large 
body of algorithms and heuristics to solve its problems. A large variety of conventional test 
assembly problems have been shown to lend themselves to modeling as an LP problem with 
0-1 decision variables. Some relevant references are: Adema (1992a, 1992b), Adema, 
Boekkooi-Timminga and van der Linden (1991), Adema and van der Linden (1989), 
Amstrong and Jones (1992), Amstrong, Jones and Wu (1992), Boekkooi-Timminga (1987, 
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1 990), Theunissen (1985, 1986), Timminga and Adema (1995, 1 996), van der Linden (1994; 
to appear), van der Linden and Boekkooi-Timminga (1988), and van der Linden and Luecht 
(1996). 

An important distinction in test assembly is the one between constraints on categorical 
and quantitative attributes of test items. Categorical attributes introduce a partitioning of the 
item pool with different subsets of items corresponding to different levels of the attribute. 
Some examples of categorical attributes are: item content, cognitive level, item format, and 
gender orientation. A quantitative attribute is a parameter or coefficient with possibly 
different numerical values for each item. Examples of this type of attribute are: item p-value, 
expected response time, and item exposure rate. Constraints may also be needed to guarantee 
that items linked to the same stimulus are administered as sets. In addition, these stimuli 
themselves may involve constraints on categorical (e.g., content classification) or quantitative 
attributes (e.g., word count). 

The problem of constrained CAT can now be represented as a series of updates of the 
following optimization problem: 

maximize information at current ability estimate (2) 

subject to possible constraint(s) on the 



length of the test; 


(3) 


number of item sets in the test; 


(4) 


number(s) of items per item set; 


(5) 


categorical item attributes; 


(6) 


quantitative item attributes; 


(7) 


dependencies between items in sets; 


(9) 


categorical item set attributes; 


(10) 


quantitative item set attributes. 


(ID 



In addition, a few technical constraints may be necessary to solve the optimization problem. 
The following section gives an example of an LP formulation of this verbally stated problem. 

Example 

To present the example, the following definitions are needed: The items in the pool 
are indexed by i=l,...,I. In addition, the pool is assumed to consists of item sets, Vj, j=l,...,J, 
each of which may have a different number of items. For each item a decision variable Xi is 
used which takes the value 1 if the item is included in the test and the value 0 otherwise. 
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Likewise, a second decision variable zj is used to decide whether (zj=l) or not (zj=0) item set 
j is included in the test. In addition, the exemplary attributes in Table 1 are used. 

Table 1 



Exemplary Item and Item Set Attributes 



Attribute 


Value 




Cognitive Level of Item 


Reading Comprehension (Ci); Analytic 
Reasoning (C 2 ); Logical Reasoning (C3) 




Expected Response Time for Item 


Ti 6(0, 00 ) 




Frequency of Previous Item Usage 


fi 6 {0,1,...} 




Content of Item Set 


Humanities (S 1 ); Social Sciences (S 2 ) 




The following example of the test assembly problem is given: 




maximize X ($)x } 

i-l 


(maximum information at 0 ) 


( 12 ) 


subject to 






i 

Xxi =n, 

i=l 


(test length) 


( 13 ) 


J 

Xzj = m, 

i-i 


(number of item sets) 


( 14 ) 


Z*. < j=l J, 

ieVj 


(number of items in item set j) 


( 15 ) 


X Xi > n^ Zj, j=l,...,J, 

ieVj 


(number of items in item set j) 


( 16 ) 


X Xi < nU 1> , h=l, 2, 3, 

ieCb 


(number of items per cognitive level) 


( 17 ) 


Xxi > nil’, h=l, 2, 3, 

ieCb 


(number of items per cognitive level) 


( 18 ) 




9 



Constrained Adaptive Testing - 7 



i 



Xnxi < r <“> 

i = 1 


(response time available) 


(19) 


fiXi < f (u> , i=l,...,I, 


(maximum item exposure) 


(20) 


Xzj < n ( g u) , g=l, 2 

jeS g 


(number of item sets per content category) 


(21) 


Xzj > n ( g\ g=l, 2 

jeS g 


(number of item sets per content category) 


(22) 


X 31 + X 32 + X 33 + X 34 ^ 1 


(mutually exclusive items) 


(23) 


Zg + Z 9 + Zio ^ 1 


(mutually exclusive item sets) 


(24) 


Xi = 0, 1, i=l,...,I, 


(domain of decision variables) 


(25) 


Zj = 0, 1 , j= 1 J. 


(domain of decision variables) 


(26) 



The right-hand side coefficients in the constraints are bounds on numbers of items (n) or item 
sets (m). Upper and lower bounds are denoted by a corresponding superscript. Note that some 
of the constraints are formulated using the decision variables for the items (xj) and others 
using the variables for the item sets (zj). The constraints in ( 1 5)-( 1 6) have both types of 
variables to ensure that individual items in sets are chosen if and only if a sufficient number 
from their sets are chosen. It is evident that the model only has a solution if the numbers in 
the right-hand side coefficients are chosen consistently and the pool has enough items to 
satisfy these numbers. These conditions are assumed to be met in a deliberately designed 
CAT program. 

The model in (12)-(26) is equivalent to the maximin model for test assembly (van der 
Linden and Boekkooi-Timminga, 1989), with the exception that it does not maximize the 
information in the test proportionally at a number of 0 values but at an estimate of 6 for a 
single examinee. A review of the constraints available to model a large variety of test 
specifications is given in the same paper. 

Models for test assembly as in (12)-(26) can be solved for an optimal test (=set of 
values for the decision variables) using a standard software package for LP or a choice from 
the algorithms and heuristics offered in the test assembly package ConTEST (Timminga, van 
der Linden & Schweizer, 1996). For test assembly models with the special structure of a 
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network-flow problem, efficient algorithms are possible (Amstrong, Jones & Wu, 1992). 
Typically, the use of each of these algorithms is preceded by some form of preprocessing of 
the model or the item pool; for example, a solution of a model with a constraint as in (19) is 
generally obtained quicker if all items with fj>f <u) are first removed from the pool. 

The next section discusses how to implement models as in (12)-(26) in a CAT 
program. 



Adaptive Implementation of the Model 

It is assumed that the test stops as soon as n items are administered. Other stopping 
rules are possible but this rule is believed to enhance the face validity of the test. Adaptive 
implementation of the model in (12)-(26) involves the on-line execution of the following 
steps for each examinee: 

Step 1 : Initialize the model; 

Step 2 : Assemble an initial test according to the model; 

Step 3 : Administer the item with maximum information at the ability estimate; 

Step 4 : Update the model; 

Step 5 : Reassemble the remaining part of the test putting the items not administered 
back into the pool; 

Step 6 : Repeat Steps 3-6 until n items have been administered. 

The algorithm is adaptive because of Step 4. The update of the model in this step 

involves both an update of 0 in the objective function in (12) and an update to allow for the 
attributes of the item administered. The only thing needed to perform the latter is to insert a 
constraint into the model that sets the decision variable of this item equal to 1 . For example, if 
Item 22 is selected, the constraint X 22 =l is inserted. 

Note that when reassembling the remaining part of the test in Step 5, the items not yet 
administered are put back into the pool. Hence, the newly assembled part of the test is always 
at least as good as the old part but most likely better since the ability estimate has been 
updated. Also, if a feasible solution to the model exists for the initial test, the problem of 
reassembling later parts of the test remains feasible. 

In Table 2 the algorithm is illustrated for a 5-item test. The items in the upper triangle 
are the items already administered. The items in the lower triangle form the part of the test 
reassembled using the updated model (Step 5). The bold numbers in this triangle are the items 
selected according to the maximum-information principle. Note that bold numbers are moved 
to the upper triangle in the next column of the table. 
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Table 2 



Example of a 5-item constrained adaptive test 



Selection 












of Item 


#1 


#2 


#3 


#4 


#5 




- 


39 


39 


39 


39 




13 


- 


14 


14 


14 




27 


8 


-- 


41 


41 




28 


14 


22 


- 


22 




39 


41 


37 


22 


- 




41 


49 


41 


37 


6 



Note . Numbers in upper triangle are items already administered. Italic numbers in lower 
triangle are items in the reassembled part of the test. Bold numbers are items selected 
according to the maximum-information principle. 

Possible initializations of the model . 

How the model should be initialized in Step 1 has not yet been explained. An obvious 

way to do so is to choose a plausible value for 6 based on knowledge of the ability 
distribution of the population of examinees and to choose the values for the bounds in the 

constraints on the basis of the test specifications. A more sophisticated initialization of 6 is to 
choose a value based on prior information on the values of relevant background variables for the 
examinee. A method for estimating 0 directly from background variables is presented in van 

der Linden (submitted). An alternative is to choose a prior value for (5 and administer a short 
CAT as a pretest, ignoring the constraints in the model. The suggestions is based on the 
observation that the presence of large numbers of constraints in the test assembly models may 
slow down the convergence of the ability estimator. Therefore, it may be advantageous to relax 
the algorithm first and impose the constraints on the item selection process when the ability 
estimator has had some time to stabilize. Stabilization has been shown to be remarkably quick 
for a Bayesian alternative to the maximum-information principle of item selection known as the 
Maximum Predicted Posterior Expected Information Criterion (van der Linden, 1996). If the 
constraints are introduced at a later moment in the test, the decision variables of the items 
already administered have to be fixed at 1. Of course, to keep the original model feasible, the 
pretest can not be longer than the smallest upper bound in the right-hand sides of the constraints 
on item numbers in the model. 
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Item Sets and Item Review 

The presence of items sets in the pool entails no special measures as long as the 
structure of the pool has been modeled correctly by constraints such as those in (14)-(15), 
(21 )-(22), and (24) in the exemplary model. If an item set is chosen, an optimal number of 
items in the set between the given bounds is also chosen. Normally, item sets are to be 
administered intact. If so, Step 4 and 5 in the algorithm are postponed until the last item in the 
set has been administered. In the (unlikely) case that the items need not be administered as an 
intact set, the procedure can just be continued and the algorithm automatically selects the 
right number of items from the set at optimal moments. 

If the examinees are given the opportunity to review their responses within blocks of 
items, the only possible consequence is a revision of the ability estimate if some of the 

responses are changed. Thus, when moving to a next block, 0 may have to be revised but the 
set of constraints in the model need not be updated. 

Statistical Properties of the Ability Estimator 

To study the effect of constraints in the adaptive item selection process on the ability 
estimator for a realistic adaptive testing program, a simulation study was run using a pool of 
753 items from the LSAT. The pool consisted of three different sections, which are labeled 
here as SA, SB, and LA. All items were calibrated using the 3-PL model given in (1). The 
length of the adaptive test was set equal to n=50, with the following distribution of items 
across sections: SA: 12 items; SB: 14 items; and LA: 24 items. Large numbers of linear 
constraints were imposed on the item selection process to deal with the item-set structure of 
the pool as well as existing specifications with respect to item (sub)types, types of stimuli in 
item sets, gender and minority orientation of the stimuli, answer key distributions, and words 
counts. The numbers of decision variables and constraints in the model for the complete test 
as well as its three sections are given in Table 3. 



Table 3 

Numbers of items, item sets, decision variables, and constraints in the model 



Level 


#Items 


#Item Sets 


#Variables 


Constraints 


Test 


753 


3 


804 


433 


SA 


208 


24 


232 


179 


SB 


240 


24 


264 


218 


IA 


305 


0 


305 


30 
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Figure 1. Estimated MSE functions of the 
EAP estimator after 10, 20, 30, 40, and 50 
items (solid line: unconstrained CAT; 
dashed line: constrained CAT, in the order 
IA, SA, SB; dotted line: constrained CAT, 
in the order SA, SB, IA). 
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The following three different conditions were simulated: 

1 . Constrained CAT, with sections in the order LA, SA, SB; 

2. Constrained CAT, with sections in the order SA, SB, IA; 

3. Unconstrained CAT. 

Because Section IA was least severely constrained, a comparison between the results for the 
first two conditions shows the effect of imposing the majority of the constraints after the 
ability estimator is stabilized. The comparison between the first two and the last condition 
shows the effect of the 433 constraints on the ability estimator. 

Adaptive tests were simulated for 0 =-2.0, -1.5,..., 2.0, and the procedure was replicated 
100 times for each 0 value. Ability was estimated using the EAP estimator with a uniform prior 
distribution. The initial ability estimate was set equal to 0. At each step the LP model was 
solved using the First Acceptable Integer Solution Algorithm (Adema, 1992b; Timminga, van 
der Linden & Schweizer, 1996, sect. 6.6). This heuristic is based on the following adaptation of 
the branch-and-bound method. Let Zlp be the value of the objective function in the solution to 
the relaxed model. This value is as an upper bound to the solution of the model with 0-1 
variables. The branch-and-bound search is stopped as soon as the current solution is larger 
than hjZLp, with h|<l but large enough to guarantee a satisactory result. In addition, following 
Crowder, Johnson, and Padberg (1983), the optimal reduced costs in the relaxed solution are 
used to fix some of the nonbasic variables. Let dj be the costs associated with nonbasic 
variable Xj. Then, if Xj=0 in the relaxed solution and ZLp-h 2 ZLP<dj, h 2 <l, the variable is fixed 
to 0. Likewise, Xj is fixed to 1 if xj=l in this solution and z L p-h 2 ZLP<-dj. For the LP models in 
the present example, the best setting found was h i=.90 and h 2 =.91 . Parameter h 2 has to be set 
larger than hj, but if it is set too high, overconstraining may occur. In manual test assembly, 
the heuristic is then rerun with a lower value for this parameter. In the current framework of 
adaptive testing, however, it was decided not to reassemble the test and to select the next item 
simply from the last test assembled. The effect of this measure, which was applied for 4.06% 
of all items selected in this study, is possibly less than optimal item selection and hence 
underestimation of the efficiency of the ability estimator. The results from the comparison 
between the mean-squared error (MSE) of the ability estimator in the constrained and 
unconstrained adaptive modes presented below is therefore expected to be slightly 
conservative with respect to the former. 
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Figure 2. Estimated bias functions of the 
EAP estimator of ability after 10, 20, 30, 
40, and 50 items (solid line: unconstrained 
CAT; dashed line: constrained CAT, in 
the order IA, SA, SB; dotted line: 
constrained CAT, in the order SA, SB, 
IA). 
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All runs were made on a PC with Pentium/ 133MHz processor. The CPU times needed 
to select an item in the constrained mode, that is, to update reassemble the test, and select 
an item with maximum information from it, were all within 1-2 secs. These figures show that 
the approach proposed in this paper is practically feasible for item pools and test specifications 
such as those used in this example. 

The MSE functions of the EAP ability estimator after n=10, 20, 30, 40, and 50 are 
presented in Figures 1. For n=10, the functions for Condition 1 (constrained CAT, with order: 
IA, SA, SB) and Condition 3 (unconstrained CAT) show about equal results for all values of 
0 . The function for Condition 2 reveals relatively poor performance for the CAT version with a 
more severely constrained section at the beginning of the test. However, the effect is already 
small when 20 items are administered, and for more than 30 items the results for the three 
conditions are identical for all practical purposes. The bias functions in Figure 2 show the same 
pattern. Note that in both figures the results for the lower end 

of the 0 scale tend to be somewhat poorer than those for the upper end. This difference in 
performance is likely to be due to underrepresentation of some categories of items at the lower 
end of the scale in the item pool. 



Discussion 



As already observed, other adaptive testing formats that can be used to deal with 
constraints on test contents are multi-stage and testlet-based adaptive testing. In multi-stage 
testing, the content of the test is adapted only at the end of previously determined stages. In 
addition, at each stage only a limited number of options is available each designed to be 
optimal for a previously selected ability level. In contrast, the present format adapts the 
content of the test to the updated ability estimate after each new item, selects the remaining 
part of the test from all options feasible for the item pool, and guarantees maximum 
information. Testlet-based adaptive testing offers more flexibility than multi-stage testing but 
in principle the same differences hold. In the Stocking and Swanson (1993) approach, all test 
specifications and the objective of maximal information are combined into a weighted 
objective function. Next, the items are selected from the pool to optimize this function in a 
sequential mode. Applying the approach to the empirical example in this paper, weights 
would have to be specified to reflect the desirability of each of the 433 constraints in the 
model. As a consequence of this complexity, unpredictable violations of the constraints as 
well as the principle of maximum information may occur. The approach in this paper, 
however, requires all constraints to be met. In addition, it is not based on sequential selection 
of single items but at each step selects all remaining items simultaneously to have maximum 
information at the ability estimate. 
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